Chromatin immunoprecipitation experiments followed by sequencing (ChIP–seq) detect protein–DNA binding events and chemical modifications of histone proteins (Figure 1).
Comparison of ChIP-seq experimental protocols (adapted from Furey 2012, Nature Reviews).
Profiles obtained from transcription factor (TF) ChIP-seq and histone ChIP-seq are different (Figure 2):
The transcription factor and histone modification landscape for inactive and active cis-regulatory elements (CRE; promoters and enhancers) and the corresponding read profiles (from Pundhir et al. 2015, Frontiers in Genetics).
When sequencing ChIP-seq samples, we usually need a control for the antibody inespecific binding, known as input. We can use as input bulk DNA or DNA immunoprecipitated with IgG.
You have to consider that when sequencing ChIP-seq, many of the parameters you can choose depend on the specifics of your experiment, your question and your budget! However, to provide you with some general guidelines, this are some general recommendations:
The datasets we are going to use for this workshop are from Hlady et al., 2019 Hepatology, in which they compare several assays in normal liver with cirrotic liver and hepatocellular carcinoma (HCC) samples from human donors.
All raw data from their assays can be accessed via their GEO ID: GSE112221. The Gene Expression Omnibus (GEO) is a public functional genomics data repository that accepts array- and sequence-based data. Each dataset is given a GEO ID that usually starts with GSE. Inside each dataset, you can find independent samples with identifiers prefixed GSM.
In GEO you can download already processed data, such as bed files. However, we here are interested in downloading the raw fastq files so we can do all the processing ourselves. Therefore, for each sample we are interested in, we need to find the SRA ID (SRX, you can find it in the sample page in GEO) and from there, gather the run ID (SRR) which will give us access to the raw data.
+ More.You can download the files directly from the SRA website by clicking on the run ID, but you can also do so from the terminal using the SRA toolkit tool fastq-dump.
With fastq-dump you will be able to download fastq files directly from SRA. Since these samples are paired-end, you need to include the argument --split-3 to send each pair to a different file. --gzip will compress the fastq output to fastq.gz.
# Download fastq files from SRA database
fastq-dump --split-3 --gzip -O your_folder/ SRR6880492
Here you have a complete list of the samples we are going to use in this workshop and their SRA run IDs:
| Sample names | SRA run ID | Sequenced reads | Description |
|---|---|---|---|
| NL1_h3k27ac | SRR6880492 | 19,808,136 | H3K27ac ChIP-seq in Normal Liver |
| Cirr1_h3k27ac | SRR6880493 | 26,149,171 | H3K27ac ChIP-seq in Cirrotic Liver (Patient 1) |
| HCC1_h3k27ac | SRR6880494 | 18,789,828 | H3K27ac ChIP-seq in Hepatocellular Carcinoma (Patient 1) |
| Cirr3_h3k27ac | SRR6880495 | 25,896,625 | H3K27ac ChIP-seq in Cirrotic Liver (Patient 3) |
| HCC3_h3k27ac | SRR6880496 | 21,954,923 | H3K27ac ChIP-seq in Hepatocellular Carcinoma (Patient 3) |
| NL1_input | SRR6880507 | 35,050,237 | Input ChIP-seq in Normal Liver |
| Cirr1_input | SRR6880509 | 24,301,177 | Input ChIP-seq in Cirrotic Liver (Patient 1) |
| HCC1_input | SRR6880511 | 27,006,532 | Input ChIP-seq in Hepatocellular Carcinoma (Patient 1) |
| Cirr3_input | SRR6880513 | 36,198,135 | Input ChIP-seq in Cirrotic Liver (Patient 3) |
| HCC3_input | SRR6880515 | 14,131,132 | Input ChIP-seq in Hepatocellular Carcinoma (Patient 3) |
A general workflow for ChIP-seq analysis consists on the following steps:
Before starting any processing steps, one should check the overall quality of the reads obtained from the sequencing facility. To do so, we are going to use FastQC, a quality control tool for high-troughput sequencing data.
fastqc folder_with_fastq/sample_reads.fastq
After checking that the overall quality is good, one can trim reads and/or filter out reads with low overall quality using different tools1.
fastqc for the control liver sample.
fastqc fastq/NL1_H3K27ac.fastq
We call alignment (or mapping) to the process of comparing reads to a reference genome, scoring the similarity and then assigning most likely genomic coordinates to each read.
A reference genome is a database of DNA sequences, assembled as a representative of a species genome. Usually, one can find two versions of the reference genome: NCBI (starts with GRC) and UCSC (hg for human, mm for mice). The difference between the two basically lies in the naming of the chromosomes: NCBI uses numbers (i.e. 1, 2, X) and UCSC numbers plus chr (i.e. chr1, chr2, chrX).
Nowadays, the most used versions of the human genome are the following:
There are tools to convert coordinates from one build to another, such as liftOver from UCSC.
Aligners are tools specifically designed to map reads to a reference genome using different algorithms. Many aligners are available, but the most used for aligning ChIP-seq data are:
Both tools allow alignment of single and paired end reads, you just have to specify it when running them. Generating an index is a techinque that many aligners use to speed up their running type. You may think about it as a book index: when having an idex it is then easier and faster to find specific parts of the book (i.e. the reference genome). Paper describing indexing methods
The output of the aligner is a SAM/BAM file containing the read sequence and the assigned coordinates respect to the reference genome. The aligner will report the number of aligned reads (which should be >80% of the sequenced reads) and which of those where uniquely mapped or multi-mapping (they match more than one region in the genome). Aligners such as bowtie2 report the best alginment for multi-mapping reads.
SAM/BAM filesSAM (Sequence Alignment/Map) and BAM (same as SAM but binary, compressed and indexed) are file types used for storing sequence and alignment information.
Can you tell which one of the folloing files is in SAM and which one in BAM format?
File 1:
�BC�W[�Y�gbb&J��“����έ��q�鮙4۷t�
��P�m&=�m�z���� >+���ºB\(���|}�U�}[��U�._4�����T����������������/�WN?p�Bem��l�n��g��`4�.�'�����z������ �t��� !�I�\)��’1�M#0��@
3�>�”�0��/æ�R��P�T�D ��L&���#�1BM������P2��p��`I$�I}�M�D^t�P�h2f�cFɠ�ό�
File 2:
@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
BAM/SAM files consist on two different parts:
@XX and contains information on the file itself: how it is ordered, which is the reference genome used, which program you used to align the reads, etc.You can find more information on all the fields present in the SAM/BAM files from the Samtools documentation.
Can you tell which lines are the header and which lines are the body of this SAM file?
@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1
@HD and @SQ are the header.
HD) is telling you the format version (VN:1.6) and that the file is sorted by coordinates (SO:coordinate).SQ) contains information on the reference genome used for the alignment: the name of the reference genome (SN:ref) and it length (LN:45).Before running an alignment, you will have to download the reference genome you want to use and index it.
For downloading hg19 you can go to UCSC downloads > Human > hg19 > Full data set > hg19.2bit. 2bit is a compression that UCSC uses for making fasta files smaller. You can convert a 2bit file into a fasta file by running twoBitToFasta from UCSC tools.
twoBitToFastq hg19.2bit hg19.fa # Uncompress the file
bowtie2-build hg19.fa hg19 --threads 5 # Create index with Bowtie2
We got our index and we got our reads, so let’s run the alignment! You can specify different parameters for the alignment (check bowtie2 -h to see them all), we are going to use the following:
-t: Print the time it takes to perform each step.-x: Index filename prefix (without the trailing .X.bt2).-U: Single-end reads file (if it were paired end you need to specify the first pair file with -1 and the second pair with -2). Fastq files can be gzipped (extension: .gz), so no need to unzip before aligning!-S: Specify file for SAM output.bowtie2 -t -x index -U single_end_reads.fastq -S output.sam
Once your alignment is done, you usually will have to perform some routine steps to process the the output:
SAM file. You should convert it to BAM, which is a more light-weight way of storing the same data.BAM file by genomic coordinates.BAM file: duplicates, reads mapping to ENCODE black-listed regions, chromosomes you are not interested in, etc.BAM file. As with reference genomes, you can create a index for your BAM file so operations requiring accessing parts of it will run much faster.Usually, for running operations on BAM or SAM files, the preferred program is Samtools.
SAM to BAMThis first step can be easily done using samtools view, with the following arguments:
-@ 4: Indicates the number (4) of additional threads to use (default is 0).-b: Tells samtools to convert to BAM.-o: Output filename for the BAM file.samtools view data/ChIP-seq/BAM/file.sam -@ 4 -b -o data/ChIP-seq/BAM/file.raw.bam
BAM file by coordinateMost programs need an input BAM file in which reads are sorted according to their genomic position. To order our BAM file we can use samtools sort with the following parameters:
-o: Output filename for the sorted BAM file.-m 2G: Tells Samtools to use up to 2G per thread.-@ 4: Uses 4 __additional__threads.samtools sort data/ChIP-seq/BAM/file.raw.bam -o data/ChIP-seq/BAM/file.srtd.bam -m 2G -@ 4
Do you remember bash pipes? You can use them with samtools too! You can pipe the output of samtools view (a BAM file) to samtools sort adding - , without having to create any intermediate files:
samtools view data/ChIP-seq/BAM/file.sam -b | samtools sort - -o data/ChIP-seq/BAM/file.srtd.bam
-r: Remove duplicate reads.-@ 4: Use 4 additional threads.samtools markdup data/ChIP-seq/BAM/file.srtd.bam data/ChIP-seq/BAM/file.rmdup.bam -r -@ 4
samtools index file.rmdup.bam